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COMPUTING ACCURATE HORNER FORM APPROXIMATIONS TO 
SPECIAL FUNCTIONS IN FINITE PRECISION ARITHMETIC 


TOR MYKLEBUST 

Abstract. In various applications, computers are required to compute approximations to 
univariate elementary and special functions such as exp and arctan to modest accuracy. 
This paper proposes a new heuristic for automating the design of such implementations. 
This heuristic takes a certain restricted specification of program structure and the desired 
error properties as input and takes explicit account of roundoff error during evaluation. 


1. Introduction 

Various programming language standards, such as the Java Language Specification [2] re¬ 
quire that a certain set of elementary and special functions (in this paper simply “mathemat¬ 
ical functions”) be provided for their native floating-point types and that their implementa¬ 
tions are faithfully-rounded —that is, that they produce one of the two machine-representable 
numbers bracketing the exact, real-number value. 

Projects such as LIBULTIM |20| and CRlibm [6] provide a wide array of correctly-rounded 
functions—that is, they always produce the closest machine-representable number to the 
exact function value. CRlibm provides correctly-rounded functions in all four rounding 
modes specified by the IEEE 754 floating-point arithmetic standard. The fdlibm library, 
developed by Sun, is a portable, widely-used math library that delivers faithfully-rounded 
results for a number of important functions. 

LIBULTIM and CRlibm make extensive use of both of table lookups and of conditional 
branches in their function implementations. For most functions, the fdlibm library splits the 
function’s domain into several intervals and uses a different approximation on each interval. 

If one wishes to compute the same mathematical function on several different inputs at 
once, it is natural to consider using the fast vector units that are widespread in modern 
computers. It is often difficult to achieve a significant speedup using a vector unit with code 
that is rich in table lookups and conditional branches since different entries of the vector can 
result in different table accesses or different code paths. In several models of processor power 
consumption, using a vector instruction in the place of a sequence of scalar instructions can 
also confer a significant power savings [TBJ ID] . 

Furthermore, mathematical function evaluation is responsible for a substantial fraction of 
execution time and power use in some applications. [Tfi] gives a brief computational study of 
two applications in high-energy physics where faster but less accurate mathematical functions 
lead to physically acceptable results within a substantially shorter timeframe. 
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Recent projects such as SLEEF [19] and Yeppp [7] provide general-purpose, fast, and 
reasonably accurate mathematical function implementations. These libraries allow a pro¬ 
grammer to compute the same function applied to a vector of floating-point numbers con¬ 
siderably faster than by applying a traditional implementation to each element in sequence. 
The mathematical function implementations in these libraries consist of a simple argument 
reduction that is easy to vectorise, then a polynomial evaluation, then a reconstruction step 
that is also easy to vectorise. Both of these libraries focus on delivering results very quickly 
at the expense of a significantly weaker guarantee on the error between the delivered result 
and the mathematical result. 

The work in this paper is motivated by a desire to improve the error bounds achievable 
within the algorithmic framework represented by SLEEF and Yeppp without sacrificing the 
speed of the resulting implementation. 

Fundamental work on the problem of computing good machine approximations to func¬ 
tions has been done previously. 1 mention only the following two recent papers. 

Brisebarre, Muller, and Tisserand [1] give an algorithm based on enumerating lattice points 
inside a polyhedron for hireling the polynomial with machine-representable coefficients that 
best approximates, when evaluated in real arithmetic, a given function. 

Brisebarre and Chevillard [3j give a heuristic based on lattice basis reduction for computing 
a polynomial with machine-representable coefficients that approximates a given function well 
when evaluated in real arithmetic. 


2. Overview 

This paper gives a heuristic that tries to find polynomials with machine-representable 
coefficients that approximate a given function well when evaluated in machine arithmetic. 
It takes as input a partially-specified straight-line program of fused multiply-adds (in a 
certain Horner-like form), an interval of interest, and a function that gives, for each machine- 
representable number in said interval, the range of acceptable function values. The heuristic 
either fails or produces as output a fully-specified straight-line program of fused multiply- 
adds that produces an acceptable function value for every machine-representable number in 
the specified interval. 

The heuristic presented in this paper can be understood as a refinement of Brisebarre, 
Muller, and Tisserand’s polyhedral approach. Importantly, this heuristic accounts for round¬ 
off error that occurs while evaluating the function. Further, this heuristic can take explicit 
account of any argument-reduction and reconstruction steps necessary in the mathemati¬ 
cal function implementation. The key recent advance in optimisation technology that makes 
this heuristic practical is the exact linear optimisation package QSopt_ex of Applegate, Cook, 
Dash, and Espinoza [T]. 

An implementation of the heuristic in this paper, together with a distribution of QSopt_ex, 
is available at http: //github. com/tmyklebu/f unapprox. 

3. Notation 

C99’s hexfloat notation is used extensively in this paper. In this notation, a finite, normal 
binary32 floating-point constant is written unambiguously in the form Oxl .mmmmmmp+eef. 
where Ox is literal, each m is a hexadecimal digit in the signiheand, p is literal, +ee is the 
exponent, and f is a literal suffix indicating that the number is binary32. See the recent Cll 
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|13j standard’s description of “hexadecimal floating-point constants” for a formal definition. 
As an example, Oxl. If 2p+lf is the constant 1 • 2 1 + 1 • 2 -3 + 15 • 2 -7 + 2 • 2 _n . 

“Ulp” is a shorthand for “unit in the last place.” Ignoring special cases, this is the 
difference between one floating-point number and the smallest floating-point number larger 
than it in magnitude. If x is a real number, then an “ulp of x”, written ulp a;, is the 
difference between the largest floatin g-p oint number less-than-or-equal-to x and the smallest 
floating-point number larger than x. □ 

4. A RUNNING EXAMPLE 

The following partially-specified C program will be used as a running example: 

float sin_poly(float a) { 
float s = a * a; 
float r5 = fmaf(s, c9, c7); 
float r4 = fmaf(s, r5, c5); 
float r3 = fmaf(s, r4, c3); 
float r2 = s * r3; 
float rl = fmaf(a, r2, a); 
return rl; 

} 

Here, fraaf is the “fused multiply-add” for binary32 numbers; fmaf (a, b, c) is the closest 
machine-representable number to ab + c. 

The program sin_poly above attempts to compute the Horner form 

(1) CL T CL ' Cl (C3 T CL (C5 T CL (C7 T Cl C9))) 

in binary32 arithmetic. (Note that simple multiplications such as s = a * a; may be 
computed as s = fmaf (a, a, O.Of) ; if only for the sake of uniformity.) 

It is desired that coefficients C3, C5, 07, and eg, each in [—1,1], are computed such that, 
for every binary32 number a between —7r/ 4 and 7 t/ 4, the value returned from the call 
sin_poly(a) is within 0.65 ulp of sin a. The choice of [—1,1] for every coefficient is arbitrary, 
but finite and reasonably small bounds on every coefficient are necessary so that the error 
bounds in the next section can give useful results. 

Two obstacles present themselves. First, each coefficient (C3, C5, C7, and eg) must be a 
binary32 number. Second, s, r^, r 4, 73, T2, and iq are evaluated in binary32 arithmetic—it 
is not enough that ([I]) is within 0.65 ulp of sin a for every a G [—7 t/4, 7t/4]. 

5. Error bounds 

The fused multiply-add (FMA) is available on many modern processors. The fused 
multiply-add computes ab + c, for machine numbers a, b , and c, with only a single rounding 
at the end. This calculation is written as fma (a,b,c). The lack of rounding of the interme¬ 
diate product ab often results in greater accuracy in a larger computation that makes use 

lr There are several other definitions of ulp x that differ when x is equal to or near a signed power of two. 
See 15| for an extended discussion of various definitions of “ulp.” Any reasonable definition will do for the 
purpose of understanding this paper. 
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of the FMA. The IEEE standard for floating-point arithmetic [12], as of 2008, requires that 
fma(a, b, c ) delivers the closest machine-representable number to ab + c. 

A very useful property of the FMA is that, as a univariate function of a, fma(a, b, c ) is a 
monotone (increasing or decreasing) function. Thus, given lower and upper bounds, say / 
and /, on fma(a, b, c), and values of b and c, one can use binary search to End, quickly and 
exactly, the interval of representable numbers a such that / < fma (a,b,c) < f. 

As mentioned, if a, b , and c are machine-representable numbers, the fused multiply-add 
fma(a, b, c ) is the closest machine-representable number to ab + c. Thus, absent exponent 
overflow or underflow, one can bound 

( 2 ) |fma(a, b, c ) — (ab + c)| < - ulp(afe + c). 

If a and c are machine approximations to real numbers a + da and c + 5c but b is a 
machine-representable number known exactly, one can use the triangle inequality to bound 

|fma(a, 6 , c) — ((a + Sa)b + (c + 5c))| 

(3) < | ulp(a 6 + c) + | Ac + bSa\ 

< j ulp(afe + c) + |5c| + \b5a\ . 

Thus, if one has bounds on ab + c, |5a|, and |5c|, one can compute an explicit bound on the 
difference between fma(a, b, c ) and the exact-arithmetic result (a + Sa)b + (c + Sc). 

Applying this bound naively (i.e. taking no account of possible cancellation) to the exam¬ 
ple of sin_poly (0 .5f), one can compute s = 0.25 and then End 

t 5 + Sr 5 = 0.25 c 9 + c 7 

with |5r s | < | ulp 1.25 = 2 " 25 and |r 5 | < 1.25 + 2" 25 . 

Then 

v 4 T Sr^ = C 5 T 0.25c 7 T 0.25 2 c 9 

with 

|5r 4 | < - ulp(1.3125 + 2 -25 ) + |0.255r 5 | < 2 ~ 25 + 2 ~ 27 
and |r 4 | < 1.3125 + 2 ~ 25 + 2“ 27 . 

Bounds on |5r 3 |, |r 3 |, |5r 2 |, |r 2 |, |5r 4 |, and |r 4 | can be computed analogously. 

Note also that, given a, bounds that must be satisfied by rq can be computed. For a = 
0 .5f, we desire that r 4 is within 0.65 ulp of sin Since sin | is roughly 0 x 1 . eaee8744b0p-2f, 
this means that we desire 

0 x 1 .eaee 86 p- 2 f < zq < 0 x 1 .eaee 88 p- 2 f. 

Note that these lower and upper bounds are adjacent binary32 numbers. In some other 
cases, such as a = 0.625f, these lower and upper bounds are equal. 

Since 7 q is computed using only r 2 and a, this implies bounds on r 2 . One can use binary 
search to find the smallest and largest binary32 numbers such that 

0 x 1 . eaee 86 p- 2 f < fma(a, r 2 ,a) < 0 x 1 . eaee 88 p- 2 f. 

This yields the bounds 

-0x1.5117aep-5f <r 2 < -0x1.511790p-5f. 
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These bounds are nonadjacent binary32 numbers. Since r 2 is computed using only r 3 and 
s, this implies bounds on r 3 : 

-Oxl.5117aep-3f < r 3 < -0x1.511790p-3f. 

Unfortunately, r 3 is computed using the coefficient c 3 , so similar bounds on r 4 cannot be 
obtained. However, if c 3 is fixed, this can be done. 

6. Formulating linear constraints 

Given a value of the abscissa a, the value of s can be computed directly and acceptable 
bounds on ry can be derived from the value of sin a. As in the previous section, upper and 
lower bounds on the difference between the computed value sin_poly(a) and the exact- 
arithmetic value of the Horner form ([1]) can be found. 

Suppose this difference is at least 5 and at most 5. Then the coefficients c 3r .. i 9 must satisfy 
the linear inequalities 

sin(a) — 0.65 ulp(sin(a)) + 5 

(4) < a + a ■ s(c 3 + s(c 5 + s(c 7 + sc 9 ))) 

< sin(a) + 0.65 ulp(sin(a)) + 5. 

An acceptable list of coefficients c 3) ... :9 must satisfy ((4j) for every a of interest. However, one 
need not formulate all such constraints in the beginning. One can generate the inequalities 
(J 3 J) for some small subset of the points of interest, find a solution, and then look for points 
a not yet considered for which (j3J) is violated. This is often called a cutting-plane method. 

Indeed, a classical theorem of Hclly [9] (see also B2I) implies that a (possibly very large) 
system of linear inequalities in n variables is either satisfiable or there exists an unsatisfiable 
subsystem of at most n + 1 inequalities. 

Given a list of linear inequalities that must be satisfied by some variables that take on real 
values, one can compute lower and upper bounds on the variables using linear optimisation. 

The linear optimisation problems that arise from (J4]) are very ill-conditioned. The left- 
hand side is a Vandermonde matrix and the bounds on each row of this Vandermonde, 
called 6_ and S in (HI) , are often very close together. Conventional inexact linear optimisation, 
as implemented in systems such as Gurobi J8] and IBM’s CPLEX HU, can give meaning¬ 
fully incorrect bounds and even confuse feasible systems of linear inequalities with infeasible 
systems. Therefore, fast, exact linear optimisation is necessary. 

Note also that, if c 3 is fixed to some value, different (likely tighter) linear inequalities on 
C5, c 7 , and eg may be formulated since, for each abscissa a, the interval of acceptable values 
of r 4 can now be derived. 


7. The heuristic 

The state of the heuristic has several components: 

• A nonempty list of test points. 

• Finite lower and upper bounds on each variable. 

The heuristic runs the following loop until something fails: 

(1) Find a solution c such that p c (x) is an acceptable rounding of f(x) for every test 
point x. 
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Fix an ordering of the coefficients [ci,..., c*,]. 
loop some number of times 
for i = 1 to k do 

If i = k, report success. 

Compute lower and upper bounds on q by exact linear optimisation—say q < q < 

C~t- 

If infeasible, break. 

Fix q to a randomly-chosen representable number between q and q. 

end for 
end loop 

Fail. 


Figure 1 . Pseudocode for Step 1 . 

(2) Find a point x such that p c (x) is not an acceptable rounding of f(x) and add x to 
the list of test points. 

(3) Go to 1. 

Step 1 is done roughly as in Figure [TJ but with some refinements described later. If step 
1 fails, the heuristic reports failure. This does not necessarily imply that the problem is 
infeasible. 

Step 2 is done by trying every abscissa of interest until at least one results in an unaccept¬ 
able function value. This is only practical for smaller domains, such as those arising from 
IEEE binary32 functions. 

Choosing a distribution other than the uniform distribution on [q,q] may result in better 
performance. I use the average of two uniform samples on [q, q]; better choices may exist. 

It is also wasteful to take only a single sample of q +1 ,..., C& after fixing q. The linear 
optimisation problems later in the for loop tend to solve faster than those earlier in the loop 
Further, a single bad coefficient choice later in the variable fixing process can scuttle a good 
choice of an earlier coefficient. Instead, I recursively try a constant number (four) of choices 
of q + 1 after fixing q. The first choice of q made is always the value of q that most recently 
yielded an acceptable list of coefficients. 

8 . Examples 

This section gives examples of C functions that compute arctan on [—1,1] and on (—oo, oo) 
using fused multiply-add. The arctan function was chosen because it admits an especially 
simple argument reduction to [—1,1], yet some care must be taken to get a faithfully-rounded 
result on that interval. 

8.1. Arctangent on [—1,1]. The C function in Figure[2]is a faithfully-rounded approxima¬ 
tion to arctanx for x a binary32 number in [—1,1], On every binary32 number a in [—1,1], 
atan_poly(a) produces a result that differs from arctan a by less than 0.95 ulp. This was 
computed in about five minutes from the partial straight-line program in Figure [3j fixing 
variables in the order c 3 , c 5 , c 7 , c 9 , Cn, q 3 , q 5 , q 7 . 

By way of comparison, SLEEF’s implementation xatanf has a maximum error of roughly 
1.773 ulp on [—1,1], which drops to roughly 1.707 ulp if mlaf is replaced by fmaf. 
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float atan_poly(float a) { 
float s = a * a; 
float r = Oxl.6d2026p-9f; 


r = fmaf(r. 

s, 

-0x1.03f2d4p-6f); 

r = fmaf(r. 

s. 

0x1.5beeb4p-5f); 

r = fmaf(r. 

s. 

-0x1.33194ep-4f); 

r = fmaf(r. 

s. 

0xl.b403a8p-4f); 

r = fmaf(r. 

s. 

-0x1.22f5c2p-3f); 

r = fmaf(r. 

s. 

0x1.997748p-3f); 

r = fmaf(r. 

s. 

-0xl.5554d8p-2f); 


r = r * s; 

return fmaf(r, a, a); 

} 


Figure 2. A faithful approximation to arctan on [—1,1]. 
float atan_poly(float a) { 


float s = a 

* 

a; 

float r = cl7; 
r = fmaf(r, s. 

cl5) ; 

r = fmaf(r, 

s, 

cl3); 

r = fmaf(r, 

s, 

ell); 

r = fmaf(r, 

S, 

c9) ; 

r = fmaf(r, 

S, 

c7); 

r = fmaf(r, 

S, 

c5); 

r = fmaf(r, 

S, 

c3); 

r = r * s; 
return fraaf( 

r, 

a, a) ; 


} 


Figure 3. A partially-specified approximation to arctan. 

The Sollya tool [5], when given input 

fpminimax(atan(x), [|1,3,5,7,9,11,13,15,171], 
[|24,24,24,24,24,24,24,24,24|], [le-6, 1]); 

produces the coefficients given in Figure SI Sollya’s fpminimax uses an implementation 
of the method of Brisebarre and Chevillard. When the coefficients of this polynomial are 
substituted into Figure SI the resulting function has a maximum error of roughly 1.067 ulp. 
The proposed heuristic therefore gives an improvement in the maximum error of about 0.117 
ulp over Sollya. 

8.2. Arctangent everywhere. The proposed approach allows one to specify the function 
to be approximated by means of a function taking an argument x and returning the range 
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float cl7 = 2.90188402868807315826416015625e-3f; 
float cl5 = -1.62907354533672332763671875e-2f; 
float cl3 = 4.3082617223262786865234375e-2f; 
float ell = -7.5408883392810821533203125e-2f; 
float c9 = 0.1066047251224517822265625f; 
float c7 = -0.14209578931331634521484375f; 
float c5 = 0.19993579387664794921875f; 
float c3 = -0.3333314359188079833984375f; 


Figure 4. Sollya’s coefficients for approximating arctan on [—1,1], 

float juffa_atanf(float a) { 
float r, t; 
t = fabsf(a); 
r = t; 

if (t > l.Of) r = l.Of / r; 
r = atan_poly(r); 

if (t > l.Of) r = fmaf ( 0 x 1 .ddcb02p-lf, 0 x 1 .aee9d6p+0f, -r); 
r = copysignf(r, a); 
return r; 

> 


FIGURE 5. N. Juffa’s arctan skeleton. This calls atan_poly from Figure [3] 

of acceptable values of /(x). This can be used to find approximations that make a given 
mathematical function implementation (including, say, an argument reduction step) more 
accurate. Consider the partially-specified implementation of arctan (due to N. Juffa [IT]) in 
Figure O 

Suppose one desires a binary32 approximation of arctan that is within 1.2 ulp of arctan 
everywhere. This can be cast as an approximation problem on [0,1]. For every binary32 
number x in [0,1], one wants 

• the straight-line program given by atanf_poly in [5] to yield a result within 1.2 ulp 
of arctan(x), and 

• for every y £ (1, oo) such that 1/y rounds to x, for the straight-line program given 
by atanf_poly in [5] to yield a result such that, after line 18 is run, r is within 1.2 
ulp of arctan(i/). 

One can find the interval of machine-representable y £ (1, oo) such that 1/y rounds to x by 
binary search. It is straightforward to write a function that computes, given x, the range of 
acceptable values of atanf_poly (x) under the above two conditions. 

The proposed heuristic finds the coefficients in Figure |6] in about ten minutes. These 
coefficients yield an error less than 1.1978 ulp everywhere. By way of comparison, Sollya’s 
polynomial yields a maximum error of more than 1.535 ulp. I also tried asking for a maximum 
error of 1.1 ulp, but the heuristic failed to find a solution. 


float 

float 

float 

float 

float 

float 

float 

float 


cl7 = Oxl.686c56p-9; 
cl5 = -0x1.01dec8p-6; 
cl3 = 0x1.5a901p-5; 
ell = -0xl.32b648p-4; 
c9 = 0x1,b3f558p-4; 
c7 = -0x1.22f90cp-3; 
c5 = 0x1.99782cp-3; 
c3 = -0x1.5554d8p-2; 


Figure 6 . Coefficients for Figure [5] 

9. Concluding remarks 

This paper presented a heuristic for designing floating-point approximations to univariate 
elementary functions. It takes as input a straight-line program structure, an interval, and a 
function giving the range of acceptable values for each input in the interval. If the heuristic 
succeeds, it outputs a straight-line program implementing the function on the desired interval 
that satisfies the desired error bound. 

This paper also presented two nontrivial examples of approximations to arctan. These 
produced binary32 numbers approximating the arctangent of binary32 numbers using 
binary32 arithmetic. 

For larger floating-point types such as IEEE binary64, it is not clear to me how to prove 
in a reasonable amount of time that a given straight-line program either satisfies the desired 
error bound on an interval or to fold an abscissa where it fails. Moreover, this approach 
might not scale to the higher-degree polynomials necessary for such approximations. I leave 
these considerations to future work. 
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